TCGA Pan-cancer analyses
TCGA Pan-cancer analyses
This part will clearly describe how to analyze TCGA Pan-cancer data. Raw data used for TCGA pancan analyses have been preprocessed, detail please read preprocessing part of this analysis report.
Clean and combine data
Although data have been preprocessed in preprocessing part, they are still necessary to do some clean before really analyzing them according to our purpose.
library(tidyverse)
load("results/gsva_tcga_pancan.RData")
load("results/TCGA_tidy_Clinical.RData")
df.gsva = full_join(TCGA_Clinical.tidy, gsva.pac, by=c("Tumor_Sample_Barcode"="tsb"))Only keep tumor samples, and filter sample which sample type is “0” or “X”. Number of samples with these two sample type are very few.
df.gsva.tumor = df.gsva %>%
filter(sample_type == "Primary Tumor", !Tumor_stage%in%c("0", "X")) %>%
mutate(Tumor_stage = factor(Tumor_stage, levels = c("I", "II", "III", "IV")))Totally, 9095 tumor samples have APS value.
Strong association between APS and immune cell infiltration level
Association between GSVA scores
We know that genes used for APS calculation does not overlap with genes of immune cell type, but we don’t know if there are association between APS and them. Besides, we also don’t know if there are association between APS and two aggregate scores: TIS and IIS.
The first step we want to do is show the relationships between scores using a heatmap. The correlation calculation is based on spearman method.
df.gsva.heat = df.gsva.tumor %>%
select(-c(Tumor_Sample_Barcode:Tumor_stage)) %>%
filter(!is.na(APM)) %>% select(Project, APM, everything())
heat_mat = sapply(unique(df.gsva.heat$Project), function(x){
mat = filter(df.gsva.heat, Project == x)
mat = mat[, -1]
cor_mat = cor(mat, method = "spearman")
cor_mat[,1]
})
heat_mat = heat_mat[-1, ]
## If you dont want plot TIS and IIS here, please uncomment following code
#heat_mat = heat_mat[!rownames(heat_mat) %in% c("TIS", "IIS"), ]
library(pheatmap)
breaksList = seq(-1, 1, by = 0.01)
clust = pheatmap(heat_mat,
color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
breaks = breaksList)annotation_col = data.frame(
ProjectGroup = factor(paste0('ColCluster',cutree(clust$tree_col,3)))
)
annotation_row = data.frame(
ImmuneCellGroup = factor(paste0('RowCluster',cutree(clust$tree_row,2)))
)
rownames(annotation_col) = colnames(heat_mat)
rownames(annotation_row) = rownames(heat_mat)
pheatmap(heat_mat, color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
breaks = breaksList,
annotation_row = annotation_row,
annotation_names_row = FALSE,
fontsize_row = 8, fontsize_col = 6,
cellheight = 10, cellwidth = 10)The two heatmaps show that there is strong positive/consistent relationship between APM and immune/T cell infiltration level across TCGA cancer types. These spearman correlation coefficient are showed as a table.
DT::datatable(heat_mat,
options = list(scrollX = TRUE, keys = TRUE), rownames = TRUE)IIS is a good representation of immune cell infiltration
Immune/T cell infiltration is a good indicator for ICB response. Here we have seen that APS have strong association with both TIS and IIS. In indivial level, i.e. 9095 samples, we also observe that TIS and IIS are basically same because their correlation coeficient is about 0.91!
df.gsva.tumor %>% filter(!is.na(IIS)) %>%
summarise(corr = cor(TIS, IIS, method = "spearman"))
#> # A tibble: 1 x 1
#> corr
#> <dbl>
#> 1 0.909Therefore, we should choose one of them for downstream analysis. Finally, we choose IIS not only because it is more comprehensive, but also it is well validated.
- In vitro validation with multiplex immunofluorescence, in silico validation using simulated mixing proportions and comparison between CIBERSORT and IIS have been previously described (Senbabaoglu, Y. et al).
TIMER is another method that can accurately resolve relative fractions of diverse cell types based on gene expression profiles from complex tissues. To further validate the calculated IIS, we perform TIMER analysis and find that the result of TIMER is highly correlated with the calculated IIS.
load("results/timer.RData")
df = dplyr::select(df.gsva.tumor, Project, APM, TIS, IIS, Gender, Age, Tumor_stage, OS.time, OS, Tumor_Sample_Barcode)
df_timer = dplyr::left_join(x = df, y = timer_clean, by = c("Tumor_Sample_Barcode"="sample"))library(corrplot)
mat_timer_IIS = df_timer %>%
select(-c(APM, TIS, Gender:Tumor_Sample_Barcode)) %>%
filter(!is.na(IIS) & !is.na(T_cell.CD8)) %>%
filter(T_cell.CD8 != 0)
mat_timer_IIS.heat = sapply(unique(mat_timer_IIS$Project), function(x){
mat = filter(mat_timer_IIS, Project == x)
mat = mat[, -1]
cor_mat = cor(mat, method = "spearman")
cor_mat[,1]
})
mat_timer_IIS.heat = mat_timer_IIS.heat[-1, -22]
p.mat = sapply(unique(mat_timer_IIS$Project), function(x){
mat = filter(mat_timer_IIS, Project == x)
mat = mat[, -1]
tryCatch(expr = {
p_mat = cor.mtest(mat, method = "spearman", conf.level = .95)
p_mat[[1]][,1]
}, error = function(e){
rep(NA, 7)
})
})
p.mat = p.mat[-1, -22]
col = colorRampPalette(c("blue", "white", "red"))(200)
corrplot(mat_timer_IIS.heat, method = "color", tl.col="black", tl.srt = 45,
col = col, p.mat = p.mat, sig.level = 0.05)The color bar in figure shows correlation coefficient values. The “X” marks relationship does not pass significant test (i.e. p>=0.05).
Exploration of APS, TMB, TIGS at pan-cancer level
Load TCGA TMB data and merge all necessary datasets.
load("results/TCGA_TMB.RData")
df2 = df %>% mutate(Tumor_Sample_Barcode = substr(Tumor_Sample_Barcode, 1, 12)) %>%
arrange(APM) %>% distinct(Tumor_Sample_Barcode, .keep_all = TRUE)
tcga_all = full_join(df2, TCGA_TMB, by="Tumor_Sample_Barcode")
if(!file.exists("results/TCGA_ALL.RData")) {
save(tcga_all, file = "results/TCGA_ALL.RData")
}
rm(list = ls())Tumor mutation burden (TMB) is defined as the number of non-synonymous alterations per megabase (Mb) of genome examined. As reported previously, here we use 38 Mb as the estimate of the exome size. For studies reporting mutation number from whole exome sequencing, the normalized TMB = (whole exome non-synonymous mutation)/(38 Mb).
Original APM scores (APS) from GSVA are in the range of -1 to 1. To calculate tumor immunogenicity score (TIGS), original APM score from GSVA implementation is rescaled by minimal and maximal APM score from TCGA Pan-cancer analysis.
We calculate tumor immunogenicity score (TIGS) as following:
\[ TIGS = APS_{normalized} \times log(TMB + 1) \]
Natural logarithm is applied here. Of note, some tumors have TMB level below 1 mutation/ Mb, to avoid minus number in quantifying “tumor antigenicity”, we add number 1 to all normalized TMB.
How we generate this TIGS formula will be described at an individual part. Here we focus on association of APS, TMB and TIGS and their effects.
load("results/TCGA_ALL.RData")
tcga_all = tcga_all %>%
mutate(nAPM = (APM - min(APM, na.rm = TRUE))/ (max(APM, na.rm = TRUE) - min(APM, na.rm = TRUE)),
nTMB = TMB_NonsynVariants / 38,
TIGS = log(nTMB+1) * nAPM) %>%
rename(Event = OS, Time = OS.time)
# keep samples with survival information
df_os = tcga_all %>%
filter(!is.na(Time), !is.na(Event))
df_os %>% filter(!is.na(nAPM)) %>%
group_by(Project) %>% summarise(N=n()) %>% arrange(N)
#> # A tibble: 32 x 2
#> Project N
#> <chr> <int>
#> 1 CHOL 36
#> 2 DLBC 47
#> 3 UCS 56
#> 4 KICH 65
#> 5 ACC 79
#> 6 UVM 80
#> 7 MESO 85
#> 8 READ 92
#> 9 SKCM 102
#> 10 THYM 119
#> # ... with 22 more rows
df_os %>% filter(!is.na(TIGS)) %>%
group_by(Project) %>% summarise(N=n()) %>% arrange(N)
#> # A tibble: 32 x 2
#> Project N
#> <chr> <int>
#> 1 CHOL 36
#> 2 DLBC 36
#> 3 UCS 56
#> 4 KICH 65
#> 5 ACC 79
#> 6 MESO 80
#> 7 UVM 80
#> 8 READ 89
#> 9 SKCM 102
#> 10 THYM 118
#> # ... with 22 more rowsDistribution of APS, TMB and IIS across TCGA studies
Calculate median values to sort distribution.
#------------ TIGS, TMB, APM pancan, sort by value
df_summary = df_os %>%
group_by(Project) %>%
summarise(medianAPM = median(APM, na.rm = TRUE),
medianTMB = median(TMB_NonsynVariants, na.rm = TRUE),
medianTIGS = median(TIGS, na.rm = TRUE),
medianAPMn = median(nAPM, na.rm = TRUE),
medianTMBn = log(median(nTMB, na.rm = TRUE) +1 ))Distribution of APM score (APS) across TCGA studies.
library(scales)
library(ggpubr)
df_os %>% filter(!is.na(Project), !is.na(APM)) %>%
ggboxplot(x="Project", y="APM", color="Project", add="jitter", xlab = "TCGA Projects",
ylab = "APM Score", add.params = list(size=0.6),
legend = "none") +
rotate_x_text(angle = 45) +
geom_hline(yintercept = mean(df_os$APM, na.rm=TRUE), linetype=2) +
scale_x_discrete(limits = arrange(df_summary, medianAPM) %>% .$Project) -> p_apm
p_apmDistribution of TMB across TCGA studies.
df_os %>% filter(!is.na(Project), !is.na(TMB_NonsynVariants)) %>%
ggboxplot(x="Project", y="TMB_NonsynVariants", color="Project", add="jitter", xlab = "TCGA Projects",
ylab = "No. of Coding Somatic Nonsynonymous Mutation", add.params = list(size=0.6),
legend = "none") +
rotate_x_text(angle = 45) +
geom_hline(yintercept = mean(df_os$TMB_NonsynVariants, na.rm=TRUE), linetype=2) +
scale_y_log10(breaks= 10^(-1:4), labels = trans_format("log10", math_format(10^.x))) +
scale_x_discrete(limits = arrange(df_summary, medianTMB) %>% .$Project) -> p_tmb
p_tmbDistribution of TIGS across TCGA studies.
df_os %>% filter(!is.na(Project), !is.na(TIGS)) %>%
ggboxplot(x="Project", y="TIGS", color="Project", add="jitter", xlab = "TCGA Projects",
ylab = "TIGS", add.params = list(size=0.6),
legend = "none") +
rotate_x_text(angle = 45) +
geom_hline(yintercept = mean(df_os$TIGS, na.rm=TRUE), linetype=2) +
scale_x_discrete(limits = arrange(df_summary, medianTIGS) %>% .$Project) -> p_tigs
p_tigsCorrelation between TMB and IIS
df_TMB = tcga_all %>% filter(!is.na(IIS), !is.na(TMB_NonsynVariants)) %>%
mutate(logTMB = log(nTMB + 1))
ggstatsplot::ggscatterstats(
data = df_TMB,
x = nTMB,
y = IIS,
xlab = "No. of Coding Somatic Nonsynonymous Mutation",
ylab = "IIS Score",
# title = "Correlation between APM and IIS score in pancancer",
messages = FALSE, type = "spearman"
)
ggstatsplot::ggscatterstats(
data = df_TMB,
x = logTMB,
y = IIS,
xlab = "No. of Coding Somatic Nonsynonymous Mutation (log)",
ylab = "IIS Score",
# title = "Correlation between APM and IIS score in pancancer",
messages = FALSE, type = "spearman"
)
plot_scatter = function(data, x, y, xlab = "Median APM", ylab = "Median IIS", conf.int=TRUE, method="spearman", label.x=-0.5, label.y=0.2, label="Project", ...){
ggscatter(data, x=x, y=y,
xlab = xlab, ylab = ylab,
shape = 21, size = 3, color = "black",
add = "reg.line", add.params = list(color = "blue", fill = "lightgray"),
conf.int = conf.int,
cor.coef = TRUE,
cor.coeff.args = list(method = method, label.x = label.x, label.y=label.y, label.sep = "\n"),
label=label, repel = TRUE, ...)
}
df_project = df_TMB %>%
group_by(Project) %>%
summarise(TMB_median = median(nTMB, na.rm = TRUE),
IIS_median = median(IIS, na.rm = TRUE))
mean(df_TMB$nTMB)
#> [1] 4.350039
plot_scatter(df_project, x="TMB_median", y="IIS_median",
xlab = "Median TMB", ylab = "Median IIS", label.x = 0.2) +
geom_hline(yintercept = 0, linetype=2) + geom_vline(xintercept = 4.35, linetype = 2)Significant correlation between APM score and IIS
ggstatsplot::ggscatterstats(
data = tcga_all %>% filter(!is.na(APM)),
x = APM,
y = IIS,
xlab = "APM Score",
ylab = "IIS Score",
# title = "Correlation between APM and IIS score in pancancer",
messages = FALSE, type = "spearman"
)
df_project2 = tcga_all %>%
filter(!is.na(APM)) %>%
group_by(Project) %>%
summarise(APM_median = median(APM),
IIS_median = median(IIS))
plot_scatter(df_project2, x="APM_median", y="IIS_median",
xlab = "Median APM", ylab = "Median IIS") +
geom_hline(yintercept = 0, linetype=2) + geom_vline(xintercept = 0, linetype = 2)Survival analysis
We access survival effects of APS, TMB and TIGS across TCGA studies using unvariable cox model.
Firstly, we implement cox model and then obtain key result values from fit, i.e. p value and corresponding 95% confident interval.
library(survival)
# calculate APM cox model by project
model_APM = df_os %>%
filter(!is.na(nAPM)) %>%
group_by(Project) %>%
dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ nAPM, data = .)) %>%
summarise(Project = Project,
Coef = summary(coxfit)$conf.int[1],
Lower = summary(coxfit)$conf.int[3],
Upper = summary(coxfit)$conf.int[4],
Pvalue = summary(coxfit)$logtest[3])
# use nTMB or log(nTMB)
model_TMB = df_os %>%
filter(!is.na(nTMB)) %>%
group_by(Project) %>%
dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ log(nTMB), data = .)) %>%
summarise(Project = Project,
Coef = summary(coxfit)$conf.int[1],
Lower = summary(coxfit)$conf.int[3],
Upper = summary(coxfit)$conf.int[4],
Pvalue = summary(coxfit)$logtest[3])
model_TIGS = df_os %>%
filter(!is.na(TIGS)) %>%
group_by(Project) %>%
dplyr::do(coxfit = coxph(Surv(time = Time, event = Event) ~ TIGS, data = .)) %>%
summarise(Project = Project,
Coef = summary(coxfit)$conf.int[1],
Lower = summary(coxfit)$conf.int[3],
Upper = summary(coxfit)$conf.int[4],
Pvalue = summary(coxfit)$logtest[3])
N_APM = df_os %>% filter(!is.na(nAPM)) %>%
group_by(Project) %>% summarise(N=n())
N_TMB = df_os %>% filter(!is.na(nTMB)) %>%
group_by(Project) %>% summarise(N=n())
N_TIGS = df_os %>% filter(!is.na(TIGS)) %>%
group_by(Project) %>% summarise(N=n())
cox_APM = full_join(model_APM, N_APM)
#> Joining, by = "Project"
cox_TMB = full_join(model_TMB, N_TMB)
#> Joining, by = "Project"
cox_TIGS = full_join(model_TIGS, N_TIGS)
#> Joining, by = "Project"Secondly, we generate forest plot.
library(forestplot)
#> Loading required package: grid
#> Loading required package: checkmate
########### Forest plot
#------ APM
options(digits = 2)
forest_APM = rbind(c("Project", NA, NA, NA, "p.value", "No."),
cox_APM) %>% as.data.frame()
forest_APM$HR = c("HR", format(as.numeric(forest_APM$Coef[-1]), digits = 2))
forest_APM$Pvalue = c("p.value", format(as.numeric(forest_APM$Pvalue[-1]), digits = 2))
forestplot(fn.ci_norm = fpDrawCircleCI,
forest_APM[,c("Project", "N", "HR", "Pvalue")],
mean = c(NA, log(cox_APM$Coef)), lower = c(NA, log(cox_APM$Lower)), upper = c(NA, log(cox_APM$Upper)),
is.summary = c(TRUE, rep(FALSE, 32)),
clip = c(-4, 4), zero = 0,
col=fpColors(box="royalblue",line="black", summary="royalblue", hrz_lines = "black"),
vertices = TRUE,
xticks = c(-4, -2, -1, 0, 1, 2, 4),
hrzl_lines = list("2" = gpar(lty=1, col = "black")),
boxsize = 0.5,
# graph.pos = 3,
xlab = "log Hazard Ratio")
#----- TMB
forest_TMB = rbind(c("Project", NA, NA, NA, "p.value", "No."),
cox_TMB) %>% as.data.frame()
forest_TMB$HR = c("HR", format(as.numeric(forest_TMB$Coef[-1]), digits = 2))
forest_TMB$Pvalue = c("p.value", format(as.numeric(forest_TMB$Pvalue[-1]), digits = 2))
forestplot(fn.ci_norm = fpDrawCircleCI,
forest_TMB[,c("Project", "N", "HR", "Pvalue")],
mean = c(NA, log(cox_TMB$Coef)), lower = c(NA, log(cox_TMB$Lower)), upper = c(NA, log(cox_TMB$Upper)),
is.summary = c(TRUE, rep(FALSE, 32)),
clip = c(-4, 4), zero = 0,
col=fpColors(box="royalblue",line="black", summary="royalblue", hrz_lines = "black"),
vertices = TRUE,
xticks = c(-4, -2, -1, 0, 1, 2, 4),
hrzl_lines = list("2" = gpar(lty=1, col = "black")),
boxsize = 0.5,
# graph.pos = 3,
xlab = "log Hazard Ratio")
#---------- TIGS
forest_TIGS = rbind(c("Project", NA, NA, NA, "p.value", "No."),
cox_TIGS) %>% as.data.frame()
forest_TIGS$HR = c("HR", format(as.numeric(forest_TIGS$Coef[-1]), digits = 2))
forest_TIGS$Pvalue = c("p.value", format(as.numeric(forest_TIGS$Pvalue[-1]), digits = 2))
forestplot(fn.ci_norm = fpDrawCircleCI,
forest_TIGS[,c("Project", "N", "HR", "Pvalue")],
mean = c(NA, log(cox_TIGS$Coef)), lower = c(NA, log(cox_TIGS$Lower)), upper = c(NA, log(cox_TIGS$Upper)),
is.summary = c(TRUE, rep(FALSE, 32)),
clip = c(-4, 4), zero = 0,
col=fpColors(box="royalblue",line="black", summary="royalblue", hrz_lines = "black"),
vertices = TRUE,
xticks = c(-4, -2, -1, 0, 1, 2, 4),
hrzl_lines = list("2" = gpar(lty=1, col = "black")),
boxsize = 0.5,
# graph.pos = 3,
xlab = "log Hazard Ratio")Gene sets enriched in patients with high APM score
To identify the specific gene sets/pathway associated with high APS, we firstly run differential gene expression analysis for each TCGA cancer type based on APS status. Patients with APS of first quartile was defined as “APS-High”, patients with APS of the forth quartile was defined as “APS-Low”. Genes with p value <0.01 and FDR <0.05 were ranked by logFC from top to bottom and then inputted into GSEA function of R package clusterProfiler with custom gene sets download from Molecular Signature Database v6.2. Normalized enrichment score (NES) was used to rank the differentially enriched gene sets. In results from hallmark gene sets, several gene signatures (especially interferon alpha/gamma response) were found to be enriched in most TCGA cancer types with high APS, suggesting high APS are strongly associated with interferon alpha/gamma signaling pathway.
The code below is runned on linux server (need much time), thus if reader wanna reproduce it, you may need to download gene sets from Molecular Signature Database v6.2 and modify some file path.
#----------------------------------------
# APM DEG and Pathway Enrichment Analysis
#----------------------------------------
library(tidyverse)
load("results/TCGA_RNASeq_PanCancer.RData")
load("results/gsva_tcga_pancan.RData")
load("results/TCGA_tidy_Clinical.RData")
df.gsva = full_join(TCGA_Clinical.tidy, gsva.pac, by=c("Tumor_Sample_Barcode"="tsb"))
tcga_info = df.gsva
rm(df.gsea, df.gsva); gc()
tcga_info = tcga_info %>%
select(Project:OS.time, Age, Gender, sample_type, Tumor_stage, APM) %>%
filter(!is.na(APM))
projects = unique(tcga_info[, "Project"])
samples = colnames(RNASeq_pancan)[-1]
#-----------------------------------
# Build a workflow to calculate DEGs
#-----------------------------------
findDEGs = function(info_df=NULL, expr_df=NULL, col_sample="Tumor_Sample_Barcode", col_group="APM",
col_subset = "Project", method="limma", threshold = 0.25,
save=FALSE, filename=NULL){
# method=c("DESeq2", "limma", "edgeR", "voom")
stopifnot(!is.null(info_df), !is.null(expr_df))
stopifnot(threshold >=0 & threshold <=1)
if(!require(data.table)){
install.packages("data.table", dependencies = TRUE)
}
info_df = setDT(info_df[, c(col_sample, col_group, col_subset)])
colnames(info_df) = c("sample", "groupV", "subset")
info_df = info_df[sample %in% colnames(expr_df)]
info_df = info_df[!is.na(subset)]
all_sets = unique(info_df[, subset])
if('DEGroup' %in% colnames(info_df)){
stop('DEGroup column exists, please rename and re-run.')
}
# set threshold
th1 = threshold
th2 = 1 - th1
info_df[, DEGroup:=ifelse(groupV<quantile(groupV, th1), "Low",
ifelse(groupV>quantile(groupV, th2), "High", NA)), by=subset]
info_df = info_df[!is.na(DEGroup)]
sets = unique(info_df[, subset])
may_del = setdiff(all_sets, sets)
if(length(may_del)!=0) {
message("Following groups has been filtered out because of threshold setting, you better check")
print(may_del)
}
info_list = lapply(sets, function(x) info_df[subset == x])
names(info_list) = sets
col1 = colnames(expr_df)[1]
options(digits = 4)
doDEG = function(input, method=NULL){
#-- prepare
exprSet = expr_df[, c(col1, input$sample)]
exprSet = as.data.frame(exprSet)
exprSet = na.omit(exprSet)
input = as.data.frame(input)
rownames(exprSet) = exprSet[,1]
exprSet = exprSet[, -1]
sample_tb = table(input$DEGroup)
#--- make sure have some samples
if(!length(sample_tb) < 2 & all(sample_tb >= 5)){
group_list = input$DEGroup
#-- make sure packages are installed
if(!all(require("DESeq2"), require("limma"), require("edgeR"))){
source("https://bioconductor.org/biocLite.R")
if(!require("DESeq2")) biocLite("DESeq2", dependencies = TRUE)
if(!require("limma")) biocLite("limma", dependencies = TRUE)
if(!require("edgeR")) biocLite("edgeR", dependencies = TRUE)
}
if("limma" %in% method){
suppressMessages(library(limma))
design=model.matrix(~0+factor(group_list))
colnames(design) = c('High', 'Low')
cont.matrix=makeContrasts('High-Low',levels = design)
fit=lmFit(exprSet,design)
fit2=contrasts.fit(fit,cont.matrix)
fit2=eBayes(fit2)
topTable(fit2, number = Inf, adjust.method ='BH')
}}
}
res = lapply(info_list, doDEG, method = method)
names(res) = sets
return(res)
}
DEG_pancan2 = findDEGs(info_df = tcga_info, expr_df = RNASeq_pancan)
save(DEG_pancan2, file = "results/DEG_pancan.RData")
#----------------------------------------------------
# Pathway enrichment analysis using clusterProfiler
#--------------------------------------------------
load(file="results/DEG_pancan.RData")
library(clusterProfiler)
library(tidyverse)
library(openxlsx)
#---------------
##- setting
pvalue = 0.01
adj.pvalue = 0.05
##- process
#------- Reading GSEA genesets files
hallmark = read.gmt("~/biodata/MsigDB/h.all.v6.2.symbols.gmt")
c1 = read.gmt("~/biodata/MsigDB/c1.all.v6.2.symbols.gmt")
c2_kegg = read.gmt("~/biodata/MsigDB/c2.cp.kegg.v6.2.symbols.gmt")
c2_reactome = read.gmt("~/biodata/MsigDB/c2.cp.reactome.v6.2.symbols.gmt")
c3 = read.gmt("~/biodata/MsigDB/c3.all.v6.2.symbols.gmt")
c4 = read.gmt("~/biodata/MsigDB/c4.all.v6.2.symbols.gmt")
c5_mf = read.gmt("~/biodata/MsigDB/c5.mf.v6.2.symbols.gmt")
c5_bp = read.gmt("~/biodata/MsigDB/c5.bp.v6.2.symbols.gmt")
c6 = read.gmt("~/biodata/MsigDB/c6.all.v6.2.symbols.gmt")
c7 = read.gmt("~/biodata/MsigDB/c7.all.v6.2.symbols.gmt")
#-=---------
goGSEA = function(DEG, prefix=NULL, pvalue=0.01, adj.pvalue=0.05, destdir="~/projects/APM/results/GSEA_results"){
library(clusterProfiler)
library(openxlsx)
library(tidyverse)
filterDEG = subset(DEG, subset = P.Value < pvalue & adj.P.Val < adj.pvalue)
filterDEG$SYMBOL = rownames(filterDEG)
filterDEG = filterDEG %>%
arrange(desc(logFC),adj.P.Val)
geneList = filterDEG$logFC
names(geneList) = filterDEG$SYMBOL
res = list()
if (base::exists("hallmark")) res$hallmark = GSEA(geneList, TERM2GENE=hallmark, verbose=FALSE)
#if (base::exists("c1")) res$c1 = GSEA(geneList, TERM2GENE=c1, verbose=FALSE)
if (base::exists("c2_kegg")) res$c2_kegg = GSEA(geneList, TERM2GENE=c2_kegg, verbose=FALSE)
if (base::exists("c2_reactome")) res$c2_reactome = GSEA(geneList, TERM2GENE=c2_reactome, verbose=FALSE)
#if (base::exists("c3")) res$c3 = GSEA(geneList, TERM2GENE=c3, verbose=FALSE)
#if (base::exists("c4")) res$c4 = GSEA(geneList, TERM2GENE=c4, verbose=FALSE)
if (base::exists("c5_mf")) res$c5_mf = GSEA(geneList, TERM2GENE=c5_mf, verbose=FALSE)
if (base::exists("c5_bp")) res$c5_bp = GSEA(geneList, TERM2GENE=c5_bp, verbose=FALSE)
if (base::exists("c6")) res$c6 = GSEA(geneList, TERM2GENE=c6, verbose=FALSE)
if (base::exists("c7")) res$c7 = GSEA(geneList, TERM2GENE=c7, verbose=FALSE)
getResultList = lapply(res, function(x) x@result)
if(!dir.exists(destdir)) dir.create(destdir)
outpath = file.path(destdir, prefix)
write.xlsx(x = getResultList, file = paste0(outpath,".xlsx"))
return(res)
}
# remove STAD, CHOL, KICH and DLBC
DEG_pancan2 = DEG_pancan2[setdiff(names(DEG_pancan2),c("STAD", "CHOL", "KICH", "DLBC"))]
GSEA_list = Map(goGSEA, DEG_pancan2, names(DEG_pancan2))
save(GSEA_list, file = "results/GSEA_results_rm_notEnriched.RData")
#------- GSEA example plot
# load("results/GSEA_results.RData")
load("results/GSEA_results_rm_notEnriched.RData")
library(clusterProfiler)
gseaplot(GSEA_list$SKCM$hallmark, geneSetID = "HALLMARK_INTERFERON_GAMMA_RESPONSE")
# gseaplot(egmt2, geneSetID = "INTEGRAL_TO_PLASMA_MEMBRANE")
# gseaplot(res$hallmark, geneSetID = "HALLMARK_INTERFERON_GAMMA_RESPONSE")
#------ get main result cloumn
gsea_results = lapply(GSEA_list, function(x) {
#x@result %>% select()
lapply(x, function(gsea) {
gsea@result %>% dplyr::select(ID, setSize, enrichmentScore, NES, pvalue, qvalues)
})
})
save(gsea_results, file = "results/GSEA_results_notEnriched_small.RData")
rm(GSEA_list); gc()
#----- plot
summariseGSEA = function(pathway = NULL){
purrr::reduce(Map(function(x, y, pathway){
if(nrow(x[[pathway]]) == 0){
res = NULL
}else{
res = x[[pathway]]
res$Project = y
}
res
}, gsea_results, names(gsea_results), pathway), rbind)
}
gsea_summary = list()
gsea_summary$hallmark = summariseGSEA(pathway = "hallmark")
gsea_summary$c2_kegg = summariseGSEA(pathway = "c2_kegg")
gsea_summary$c2_reactome = summariseGSEA(pathway = "c2_reactome")
gsea_summary$c5_mf = summariseGSEA(pathway = "c5_mf")
gsea_summary$c5_bp = summariseGSEA(pathway = "c5_bp")
gsea_summary$c6 = summariseGSEA(pathway = "c6")
gsea_summary$c7 = summariseGSEA(pathway = "c7")
save(gsea_summary, file ="results/GSEA_summary_notEnriched.RData")
## plot really
load("data/df_combine_gsva_clinical.RData")
df.gsva %>%
filter(!is.na(APM)) %>%
group_by(Project) %>%
summarize(MedianAPM = median(APM), N=n()) %>%
arrange(MedianAPM) -> sortAPM
sortAPM %>% filter(Project != "DLBC") -> sortAPM
library(scales)
gsea_summary$hallmark %>%
ggplot(mapping = aes(x = Project, y = ID)) +
geom_tile(aes(fill = NES)) +
scale_fill_gradient2(low = "blue", mid = "white", high = "red") +
scale_x_discrete(limits = sortAPM$Project) + theme_classic() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))
# df_wide = reshape2::dcast(gsea_summary$hallmark, ID ~ Project, value.var="NES", fill = 0)
library(pheatmap)
plotPathway = function(df, save=FALSE, path=NULL, silent=FALSE){
df_wide = reshape2::dcast(df, ID ~ Project, value.var="NES", fill = 0)
breaksList = seq(-max(df_wide[,-1], na.rm = TRUE), max(df_wide[, -1], na.rm = TRUE), by = 0.01)
pheatmap(df_wide %>% column_to_rownames(var = "ID"),
color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
breaks = breaksList,
#annotation_col = annotation_col,
fontsize_row = 8, fontsize_col = 8, silent = silent,
cellheight = 10, cellwidth = 10, filename = ifelse(save==FALSE, NA, path))
}
gg = plotPathway(gsea_summary$hallmark, silent = F, path = "results/GSEA_results_plot/Hallmark_Enriched.pdf", save = TRUE)
gg = plotPathway(gsea_summary$c2_kegg, silent = F, path = "results/GSEA_results_plot/KEGG_Enriched.pdf", save = TRUE)
gg = plotPathway(gsea_summary$c2_reactome, silent = F, path = "results/GSEA_results_plot/Reactome_Enriched.pdf", save = TRUE)
gg = plotPathway(gsea_summary$c5_mf, silent = F, path = "results/GSEA_results_plot/GO_MolecuFunction_Enriched.pdf", save = TRUE)
gg = plotPathway(gsea_summary$c5_bp, silent = F, path = "results/GSEA_results_plot/GO_BiologicalProcess_Enriched.pdf", save = TRUE)
gg = plotPathway(gsea_summary$c6, silent = F, path = "results/GSEA_results_plot/C6_oncogenicsignatures_Enriched.pdf", save = TRUE)
gg = plotPathway(gsea_summary$c7, silent = F, path = "results/GSEA_results_plot/C7_immunologicsignatures_Enriched.pdf", save = TRUE)
gg
#_-------- rank pheatmap by NES value
plotPathway2 = function(df, save=FALSE, path=NULL, silent=FALSE){
df_wide = reshape2::dcast(df, ID ~ Project, value.var="NES", fill = 0)
df_wide = df_wide[order(rowMeans(df_wide[,-1]), decreasing = TRUE),]
rownames(df_wide) = NULL
breaksList = seq(-max(df_wide[,-1], na.rm = TRUE), max(df_wide[, -1], na.rm = TRUE), by = 0.01)
pheatmap(df_wide %>% column_to_rownames(var = "ID"),
color = colorRampPalette(c("blue", "white", "red"))(length(breaksList)),
breaks = breaksList, cluster_rows = FALSE,
#annotation_col = annotation_col,
fontsize_row = 8, fontsize_col = 8, silent = silent,
cellheight = 10, cellwidth = 10, filename = ifelse(save==FALSE, NA, path))
}
gg = plotPathway2(gsea_summary$c2_reactome, silent = F, path = "results/GSEA_results_plot/Reactome_Enriched2.pdf", save = TRUE)Session Info
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
#>
#> attached base packages:
#> [1] grid stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] forestplot_1.7.2 checkmate_1.8.5 survival_2.43-1 ggpubr_0.1.9
#> [5] magrittr_1.5 scales_1.0.0 corrplot_0.84 pheatmap_1.0.10
#> [9] bindrcpp_0.2.2 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8
#> [13] purrr_0.2.5 readr_1.1.1 tidyr_0.8.2 tibble_1.4.2
#> [17] ggplot2_3.1.0 tidyverse_1.2.1 pacman_0.5.0
#>
#> loaded via a namespace (and not attached):
#> [1] jmvcore_0.9.5 TH.data_1.0-9 effsize_0.7.1
#> [4] minqa_1.2.4 colorspace_1.3-2 rjson_0.2.20
#> [7] ggridges_0.5.1 modeltools_0.2-22 sjlabelled_1.0.14
#> [10] rprojroot_1.3-2 snakecase_0.9.2 estimability_1.3
#> [13] mc2d_0.1-18 rstudioapi_0.8 glmmTMB_0.2.2.0
#> [16] ggrepel_0.8.0 DT_0.5 fansi_0.4.0
#> [19] mvtnorm_1.0-8 lubridate_1.7.4 coin_1.2-2
#> [22] xml2_1.2.0 codetools_0.2-15 splines_3.5.1
#> [25] mnormt_1.5-5 knitr_1.20 sjmisc_2.7.6
#> [28] bayesplot_1.6.0 jsonlite_1.5 nloptr_1.2.1
#> [31] broom_0.5.0 broom.mixed_0.2.3 shiny_1.2.0
#> [34] compiler_3.5.1 httr_1.3.1 emmeans_1.3.0
#> [37] sjstats_0.17.1 backports_1.1.2 ggcorrplot_0.1.2
#> [40] assertthat_0.2.0 Matrix_1.2-15 lazyeval_0.2.1.9000
#> [43] cli_1.0.1 later_0.7.5 htmltools_0.3.6
#> [46] tools_3.5.1 ggstatsplot_0.0.6 coda_0.19-2
#> [49] gtable_0.2.0 glue_1.3.0 reshape2_1.4.3
#> [52] Rcpp_1.0.0 cellranger_1.1.0 nlme_3.1-137
#> [55] crosstalk_1.0.0 psych_1.8.10 exactci_1.3-3
#> [58] xfun_0.4 lme4_1.1-19 rvest_0.3.2
#> [61] mime_0.6 miniUI_0.1.1.1 exact2x2_1.6.3
#> [64] stringdist_0.9.5.1 MASS_7.3-51.1 zoo_1.8-4
#> [67] hms_0.4.2 promises_1.0.1 parallel_3.5.1
#> [70] sandwich_2.5-0 pwr_1.2-2 TMB_1.7.15
#> [73] RColorBrewer_1.1-2 yaml_2.2.0 purrrlyr_0.0.3
#> [ reached getOption("max.print") -- omitted 40 entries ]